Computation and Language 41
☆ Computational Language Acquisition with Theory of Mind ICLR 2023
Unlike current state-of-the-art language models, young children actively
acquire language through interactions with their surrounding environment and
caretakers. One mechanism that has been argued to be critical to language
learning is the ability to infer the mental states of other agents in social
environments, coined Theory of Mind (ToM) by Premack & Woodruff (1978). Drawing
inspiration from the modern operationalized versions of ToM implemented in
Rabinowitz et al. (2018) and Zhu et al. (2021), we build language-learning
agents equipped with ToM, and measure its effects on the learning process. We
model ToM by giving the speaker agent an internal listener model that is
trained alongside the speaker and used to rerank potential utterances. We
experiment with varying task difficulty, hypothesizing that models will acquire
more complex language to adapt to stronger environmental pressures. We find
that training speakers with a highly weighted ToM listener component leads to
performance gains in our image referential game setting. We also find some
evidence that increasing task difficulty in the training process results in
more fluent and precise utterances in evaluation. This suggests the potential
utility of further incorporating ToM, as well as other insights from child
language acquisition, into computational models of language acquisition.
comment: 9 pages, 3 figures. To be published in the 11th International
Conference on Learning Representations, ICLR 2023, Conference Track
Proceedings
☆ Language Variety Identification with True Labels
Marcos Zampieri, Kai North, Tommi Jauhiainen, Mariano Felice, Neha Kumari, Nishant Nair, Yash Bangera
Language identification is an important first step in many IR and NLP
applications. Most publicly available language identification datasets,
however, are compiled under the assumption that the gold label of each instance
is determined by where texts are retrieved from. Research has shown that this
is a problematic assumption, particularly in the case of very similar languages
(e.g., Croatian and Serbian) and national language varieties (e.g., Brazilian
and European Portuguese), where texts may contain no distinctive marker of the
particular language or variety. To overcome this important limitation, this
paper presents DSL True Labels (DSL-TL), the first human-annotated multilingual
dataset for language variety identification. DSL-TL contains a total of 12,900
instances in Portuguese, split between European Portuguese and Brazilian
Portuguese; Spanish, split between Argentine Spanish and Castilian Spanish; and
English, split between American English and British English. We trained
multiple models to discriminate between these language varieties, and we
present the results in detail. The data and models presented in this paper
provide a reliable benchmark toward the development of robust and fairer
language variety identification systems. We make DSL-TL freely available to the
research community.
☆ WiCE: Real-World Entailment for Claims in Wikipedia
Models for textual entailment have increasingly been applied to settings like
fact-checking, presupposition verification in question answering, and
validating that generation models' outputs are faithful to a source. However,
such applications are quite far from the settings that existing datasets are
constructed in. We propose WiCE, a new textual entailment dataset centered
around verifying claims in text, built on real-world claims and evidence in
Wikipedia with fine-grained annotations. We collect sentences in Wikipedia that
cite one or more webpages and annotate whether the content on those pages
entails those sentences. Negative examples arise naturally, from slight
misinterpretation of text to minor aspects of the sentence that are not
attested in the evidence. Our annotations are over sub-sentence units of the
hypothesis, decomposed automatically by GPT-3, each of which is labeled with a
subset of evidence sentences from the source document. We show that real claims
in our dataset involve challenging verification problems, and we benchmark
existing approaches on this dataset. In addition, we show that reducing the
complexity of claims by decomposing them by GPT-3 can improve entailment
models' performance on various domains.
☆ Semiparametric Language Models Are Scalable Continual Learners
Semiparametric language models (LMs) have shown promise in continuously
learning from new text data by combining a parameterized neural LM with a
growable non-parametric memory for memorizing new content. However,
conventional semiparametric LMs will finally become prohibitive for computing
and storing if they are applied to continual learning over streaming data,
because the non-parametric memory grows linearly with the amount of data they
learn from over time. To address the issue of scalability, we present a simple
and intuitive approach called Selective Memorization (SeMem), which only
memorizes difficult samples that the model is likely to struggle with. We
demonstrate that SeMem improves the scalability of semiparametric LMs for
continual learning over streaming data in two ways: (1) data-wise scalability:
as the model becomes stronger through continual learning, it will encounter
fewer difficult cases that need to be memorized, causing the growth of the
non-parametric memory to slow down over time rather than growing at a linear
rate with the size of training data; (2) model-wise scalability: SeMem allows a
larger model to memorize fewer samples than its smaller counterpart because it
is rarer for a larger model to encounter incomprehensible cases, resulting in a
non-parametric memory that does not scale linearly with model size. We conduct
extensive experiments in language modeling and downstream tasks to test SeMem's
results, showing SeMem enables a semiparametric LM to be a scalable continual
learner with little forgetting.
comment: Work in progress
☆ NLP Workbench: Efficient and Extensible Integration of State-of-the-art Text Mining Tools EACL 2023
NLP Workbench is a web-based platform for text mining that allows non-expert
users to obtain semantic understanding of large-scale corpora using
state-of-the-art text mining models. The platform is built upon latest
pre-trained models and open source systems from academia that provide semantic
analysis functionalities, including but not limited to entity linking,
sentiment analysis, semantic parsing, and relation extraction. Its extensible
design enables researchers and developers to smoothly replace an existing model
or integrate a new one. To improve efficiency, we employ a microservice
architecture that facilitates allocation of acceleration hardware and
parallelization of computation. This paper presents the architecture of NLP
Workbench and discusses the challenges we faced in designing it. We also
discuss diverse use cases of NLP Workbench and the benefits of using it over
other approaches. The platform is under active development, with its source
code released under the MIT license. A website and a short video demonstrating
our platform are also available.
comment: Camera-ready version for EACL 2023: System Demonstrations
☆ MLANet: Multi-Level Attention Network with Sub-instruction for Continuous Vision-and-Language Navigation
Vision-and-Language Navigation (VLN) aims to develop intelligent agents to
navigate in unseen environments only through language and vision supervision.
In the recently proposed continuous settings (continuous VLN), the agent must
act in a free 3D space and faces tougher challenges like real-time execution,
complex instruction understanding, and long action sequence prediction. For a
better performance in continuous VLN, we design a multi-level instruction
understanding procedure and propose a novel model, Multi-Level Attention
Network (MLANet). The first step of MLANet is to generate sub-instructions
efficiently. We design a Fast Sub-instruction Algorithm (FSA) to segment the
raw instruction into sub-instructions and generate a new sub-instruction
dataset named ``FSASub". FSA is annotation-free and faster than the current
method by 70 times, thus fitting the real-time requirement in continuous VLN.
To solve the complex instruction understanding problem, MLANet needs a global
perception of the instruction and observations. We propose a Multi-Level
Attention (MLA) module to fuse vision, low-level semantics, and high-level
semantics, which produce features containing a dynamic and global comprehension
of the task. MLA also mitigates the adverse effects of noise words, thus
ensuring a robust understanding of the instruction. To correctly predict
actions in long trajectories, MLANet needs to focus on what sub-instruction is
being executed every step. We propose a Peak Attention Loss (PAL) to improve
the flexible and adaptive selection of the current sub-instruction. PAL
benefits the navigation agent by concentrating its attention on the local
information, thus helping the agent predict the most appropriate actions. We
train and test MLANet in the standard benchmark. Experiment results show MLANet
outperforms baselines by a significant margin.
☆ Letz Translate: Low-Resource Machine Translation for Luxembourgish
Natural language processing of Low-Resource Languages (LRL) is often
challenged by the lack of data. Therefore, achieving accurate machine
translation (MT) in a low-resource environment is a real problem that requires
practical solutions. Research in multilingual models have shown that some LRLs
can be handled with such models. However, their large size and computational
needs make their use in constrained environments (e.g., mobile/IoT devices or
limited/old servers) impractical. In this paper, we address this problem by
leveraging the power of large multilingual MT models using knowledge
distillation. Knowledge distillation can transfer knowledge from a large and
complex teacher model to a simpler and smaller student model without losing
much in performance. We also make use of high-resource languages that are
related or share the same linguistic root as the target LRL. For our
evaluation, we consider Luxembourgish as the LRL that shares some roots and
properties with German. We build multiple resource-efficient models based on
German, knowledge distillation from the multilingual No Language Left Behind
(NLLB) model, and pseudo-translation. We find that our efficient models are
more than 30\% faster and perform only 4\% lower compared to the large
state-of-the-art NLLB model.
comment: The associated model is published on HuggingFace:
https://huggingface.co/etamin/Letz-Translate-OPUS-LB-EN The Dictionary used
in this paper is available in Github:
https://github.com/Etamin/Ltz_dictionary
☆ Matching-based Term Semantics Pre-training for Spoken Patient Query Understanding ICASSP 2023
Medical Slot Filling (MSF) task aims to convert medical queries into
structured information, playing an essential role in diagnosis dialogue
systems. However, the lack of sufficient term semantics learning makes existing
approaches hard to capture semantically identical but colloquial expressions of
terms in medical conversations. In this work, we formalize MSF into a matching
problem and propose a Term Semantics Pre-trained Matching Network (TSPMN) that
takes both terms and queries as input to model their semantic interaction. To
learn term semantics better, we further design two self-supervised objectives,
including Contrastive Term Discrimination (CTD) and Matching-based Mask Term
Modeling (MMTM). CTD determines whether it is the masked term in the dialogue
for each given term, while MMTM directly predicts the masked ones. Experimental
results on two Chinese benchmarks show that TSPMN outperforms strong baselines,
especially in few-shot settings.
comment: ICASSP 2023
☆ Synthetic Misinformers: Generating and Combating Multimodal Misinformation
With the expansion of social media and the increasing dissemination of
multimedia content, the spread of misinformation has become a major concern.
This necessitates effective strategies for multimodal misinformation detection
(MMD) that detect whether the combination of an image and its accompanying text
could mislead or misinform. Due to the data-intensive nature of deep neural
networks and the labor-intensive process of manual annotation, researchers have
been exploring various methods for automatically generating synthetic
multimodal misinformation - which we refer to as Synthetic Misinformers - in
order to train MMD models. However, limited evaluation on real-world
misinformation and a lack of comparisons with other Synthetic Misinformers
makes difficult to assess progress in the field. To address this, we perform a
comparative study on existing and new Synthetic Misinformers that involves (1)
out-of-context (OOC) image-caption pairs, (2) cross-modal named entity
inconsistency (NEI) as well as (3) hybrid approaches and we evaluate them
against real-world misinformation; using the COSMOS benchmark. The comparative
study showed that our proposed CLIP-based Named Entity Swapping can lead to MMD
models that surpass other OOC and NEI Misinformers in terms of multimodal
accuracy and that hybrid approaches can lead to even higher detection accuracy.
Nevertheless, after alleviating information leakage from the COSMOS evaluation
protocol, low Sensitivity scores indicate that the task is significantly more
challenging than previous studies suggested. Finally, our findings showed that
NEI-based Synthetic Misinformers tend to suffer from a unimodal bias, where
text-only MMDs can outperform multimodal ones.
☆ Document Provenance and Authentication through Authorship Classification
Muhammad Tayyab Zamir, Muhammad Asif Ayub, Jebran Khan, Muhammad Jawad Ikram, Nasir Ahmad, Kashif Ahmad
Style analysis, which is relatively a less explored topic, enables several
interesting applications. For instance, it allows authors to adjust their
writing style to produce a more coherent document in collaboration. Similarly,
style analysis can also be used for document provenance and authentication as a
primary step. In this paper, we propose an ensemble-based text-processing
framework for the classification of single and multi-authored documents, which
is one of the key tasks in style analysis. The proposed framework incorporates
several state-of-the-art text classification algorithms including classical
Machine Learning (ML) algorithms, transformers, and deep learning algorithms
both individually and in merit-based late fusion. For the merit-based late
fusion, we employed several weight optimization and selection methods to assign
merit-based weights to the individual text classification algorithms. We also
analyze the impact of the characters on the task that are usually excluded in
NLP applications during pre-processing by conducting experiments on both clean
and un-clean data. The proposed framework is evaluated on a large-scale
benchmark dataset, significantly improving performance over the existing
solutions.
comment: 7 pages; 3 tables; 1 figure
☆ UZH_CLyp at SemEval-2023 Task 9: Head-First Fine-Tuning and ChatGPT Data Generation for Cross-Lingual Learning in Tweet Intimacy Prediction SemEval-2023
This paper describes the submission of UZH_CLyp for the SemEval 2023 Task 9
"Multilingual Tweet Intimacy Analysis". We achieved second-best results in all
10 languages according to the official Pearson's correlation regression
evaluation measure. Our cross-lingual transfer learning approach explores the
benefits of using a Head-First Fine-Tuning method (HeFiT) that first updates
only the regression head parameters and then also updates the pre-trained
transformer encoder parameters at a reduced learning rate. Additionally, we
study the impact of using a small set of automatically generated examples (in
our case, from ChatGPT) for low-resource settings where no human-labeled data
is available. Our study shows that HeFiT stabilizes training and consistently
improves results for pre-trained models that lack domain adaptation to tweets.
Our study also shows a noticeable performance increase in cross-lingual
learning when synthetic data is used, confirming the usefulness of current text
generation systems to improve zero-shot baseline results. Finally, we examine
how possible inconsistencies in the annotated data contribute to cross-lingual
interference issues.
comment: Submitted for peer-review at SemEval-2023
☆ Denoising-based UNMT is more robust to word-order divergence than MASS-based UNMT
We aim to investigate whether UNMT approaches with self-supervised
pre-training are robust to word-order divergence between language pairs. We
achieve this by comparing two models pre-trained with the same self-supervised
pre-training objective. The first model is trained on language pairs with
different word-orders, and the second model is trained on the same language
pairs with source language re-ordered to match the word-order of the target
language. Ideally, UNMT approaches which are robust to word-order divergence
should exhibit no visible performance difference between the two
configurations. In this paper, we investigate two such self-supervised
pre-training based UNMT approaches, namely Masked Sequence-to-Sequence
Pre-Training, (MASS) (which does not have shuffling noise) and Denoising
AutoEncoder (DAE), (which has shuffling noise).
We experiment with five English$\rightarrow$Indic language pairs, i.e.,
en-hi, en-bn, en-gu, en-kn, and en-ta) where word-order of the source language
is SVO (Subject-Verb-Object), and the word-order of the target languages is SOV
(Subject-Object-Verb). We observed that for these language pairs, DAE-based
UNMT approach consistently outperforms MASS in terms of translation accuracies.
Moreover, bridging the word-order gap using reordering improves the translation
accuracy of MASS-based UNMT models, while it cannot improve the translation
accuracy of DAE-based UNMT models. This observation indicates that DAE-based
UNMT is more robust to word-order divergence than MASS-based UNMT.
Word-shuffling noise in DAE approach could be the possible reason for the
approach being robust to word-order divergence.
☆ CTRLStruct: Dialogue Structure Learning for Open-Domain Response Generation
Dialogue structure discovery is essential in dialogue generation.
Well-structured topic flow can leverage background information and predict
future topics to help generate controllable and explainable responses. However,
most previous work focused on dialogue structure learning in task-oriented
dialogue other than open-domain dialogue which is more complicated and
challenging. In this paper, we present a new framework CTRLStruct for dialogue
structure learning to effectively explore topic-level dialogue clusters as well
as their transitions with unlabelled information. Precisely, dialogue
utterances encoded by bi-directional Transformer are further trained through a
special designed contrastive learning task to improve representation. Then we
perform clustering to utterance-level representations and form topic-level
clusters that can be considered as vertices in dialogue structure graph. The
edges in the graph indicating transition probability between vertices are
calculated by mimicking expert behavior in datasets. Finally, dialogue
structure graph is integrated into dialogue model to perform controlled
response generation. Experiments on two popular open-domain dialogue datasets
show our model can generate more coherent responses compared to some excellent
dialogue models, as well as outperform some typical sentence embedding methods
in dialogue utterance representation. Code is available in GitHub.
comment: 12 pages, to be published in The Web Conference 2023
☆ LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme conversion ICASSP2023
As a key component of automated speech recognition (ASR) and the front-end in
text-to-speech (TTS), grapheme-to-phoneme (G2P) plays the role of converting
letters to their corresponding pronunciations. Existing methods are either slow
or poor in performance, and are limited in application scenarios, particularly
in the process of on-device inference. In this paper, we integrate the
advantages of both expert knowledge and connectionist temporal classification
(CTC) based neural network and propose a novel method named LiteG2P which is
fast, light and theoretically parallel. With the carefully leading design,
LiteG2P can be applied both on cloud and on device. Experimental results on the
CMU dataset show that the performance of the proposed method is superior to the
state-of-the-art CTC based method with 10 times fewer parameters, and even
comparable to the state-of-the-art Transformer-based sequence-to-sequence model
with less parameters and 33 times less computation.
comment: Accepted by ICASSP2023
☆ Can BERT Refrain from Forgetting on Sequential Tasks? A Probing Study ICLR 2023
Large pre-trained language models help to achieve state of the art on a
variety of natural language processing (NLP) tasks, nevertheless, they still
suffer from forgetting when incrementally learning a sequence of tasks. To
alleviate this problem, recent works enhance existing models by sparse
experience replay and local adaption, which yield satisfactory performance.
However, in this paper we find that pre-trained language models like BERT have
a potential ability to learn sequentially, even without any sparse memory
replay. To verify the ability of BERT to maintain old knowledge, we adopt and
re-finetune single-layer probe networks with the parameters of BERT fixed. We
investigate the models on two types of NLP tasks, text classification and
extractive question answering. Our experiments reveal that BERT can actually
generate high quality representations for previously learned tasks in a long
term, under extremely sparse replay or even no replay. We further introduce a
series of novel methods to interpret the mechanism of forgetting and how memory
rehearsal plays a significant role in task incremental learning, which bridges
the gap between our new discovery and previous studies about catastrophic
forgetting.
comment: Accepted by ICLR 2023. URL:
https://openreview.net/forum?id=UazgYBMS9-W
☆ LANDMARK: Language-guided Representation Enhancement Framework for Scene Graph Generation
Scene graph generation (SGG) is a sophisticated task that suffers from both
complex visual features and dataset long-tail problem. Recently, various
unbiased strategies have been proposed by designing novel loss functions and
data balancing strategies. Unfortunately, these unbiased methods fail to
emphasize language priors in feature refinement perspective. Inspired by the
fact that predicates are highly correlated with semantics hidden in
subject-object pair and global context, we propose LANDMARK (LANguage-guiDed
representationenhanceMent frAmewoRK) that learns predicate-relevant
representations from language-vision interactive patterns, global language
context and pair-predicate correlation. Specifically, we first project object
labels to three distinctive semantic embeddings for different representation
learning. Then, Language Attention Module (LAM) and Experience Estimation
Module (EEM) process subject-object word embeddings to attention vector and
predicate distribution, respectively. Language Context Module (LCM) encodes
global context from each word embed-ding, which avoids isolated learning from
local information. Finally, modules outputs are used to update visual
representations and SGG model's prediction. All language representations are
purely generated from object categories so that no extra knowledge is needed.
This framework is model-agnostic and consistently improves performance on
existing SGG models. Besides, representation-level unbiased strategies endow
LANDMARK the advantage of compatibility with other methods. Code is available
at https://github.com/rafa-cxg/PySGG-cxg.
comment: Revision period in Applied Intelligence (APIN)
☆ Targeted Adversarial Attacks against Neural Machine Translation ICASSP 2023
Neural Machine Translation (NMT) systems are used in various applications.
However, it has been shown that they are vulnerable to very small perturbations
of their inputs, known as adversarial attacks. In this paper, we propose a new
targeted adversarial attack against NMT models. In particular, our goal is to
insert a predefined target keyword into the translation of the adversarial
sentence while maintaining similarity between the original sentence and the
perturbed one in the source domain. To this aim, we propose an optimization
problem, including an adversarial loss term and a similarity term. We use
gradient projection in the embedding space to craft an adversarial sentence.
Experimental results show that our attack outperforms Seq2Sick, the other
targeted adversarial attack against NMT models, in terms of success rate and
decrease in translation quality. Our attack succeeds in inserting a keyword
into the translation for more than 75% of sentences while similarity with the
original sentence stays preserved.
comment: ICASSP 2023, Code available at:
http://github.com/sssadrizadeh/NMT-targeted-attack
☆ Adopting the Multi-answer Questioning Task with an Auxiliary Metric for Extreme Multi-label Text Classification Utilizing the Label Hierarchy
Extreme multi-label text classification utilizes the label hierarchy to
partition extreme labels into multiple label groups, turning the task into
simple multi-group multi-label classification tasks. Current research encodes
labels as a vector with fixed length which needs establish multiple classifiers
for different label groups. The problem is how to build only one classifier
without sacrificing the label relationship in the hierarchy. This paper adopts
the multi-answer questioning task for extreme multi-label classification. This
paper also proposes an auxiliary classification evaluation metric. This study
adopts the proposed method and the evaluation metric to the legal domain. The
utilization of legal Berts and the study on task distribution are discussed.
The experiment results show that the proposed hierarchy and multi-answer
questioning task can do extreme multi-label classification for EURLEX dataset.
And in minor/fine-tuning the multi-label classification task, the domain
adapted BERT models could not show apparent advantages in this experiment. The
method is also theoretically applicable to zero-shot learning.
☆ Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages
Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk, Françoise Beaufays, Yonghui Wu
We introduce the Universal Speech Model (USM), a single large model that
performs automatic speech recognition (ASR) across 100+ languages. This is
achieved by pre-training the encoder of the model on a large unlabeled
multilingual dataset of 12 million (M) hours spanning over 300 languages, and
fine-tuning on a smaller labeled dataset. We use multilingual pre-training with
random-projection quantization and speech-text modality matching to achieve
state-of-the-art performance on downstream multilingual ASR and speech-to-text
translation tasks. We also demonstrate that despite using a labeled training
set 1/7-th the size of that used for the Whisper model, our model exhibits
comparable or better performance on both in-domain and out-of-domain speech
recognition tasks across many languages.
comment: 20 pages, 7 figures, 8 tables
☆ Leveraging Large Text Corpora for End-to-End Speech Summarization ICASSP 2023
Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Atsunori Ogawa, Marc Delcroix, Ryo Masumura
End-to-end speech summarization (E2E SSum) is a technique to directly
generate summary sentences from speech. Compared with the cascade approach,
which combines automatic speech recognition (ASR) and text summarization
models, the E2E approach is more promising because it mitigates ASR errors,
incorporates nonverbal information, and simplifies the overall system. However,
since collecting a large amount of paired data (i.e., speech and summary) is
difficult, the training data is usually insufficient to train a robust E2E SSum
system. In this paper, we present two novel methods that leverage a large
amount of external text summarization data for E2E SSum training. The first
technique is to utilize a text-to-speech (TTS) system to generate synthesized
speech, which is used for E2E SSum training with the text summary. The second
is a TTS-free method that directly inputs phoneme sequence instead of
synthesized speech to the E2E SSum model. Experiments show that our proposed
TTS- and phoneme-based methods improve several metrics on the How2 dataset. In
particular, our best system outperforms a previous state-of-the-art one by a
large margin (i.e., METEOR score improvements of more than 6 points). To the
best of our knowledge, this is the first work to use external language
resources for E2E SSum. Moreover, we report a detailed analysis of the How2
dataset to confirm the validity of our proposed E2E SSum system.
comment: Accepted to ICASSP 2023
☆ Rethinking the Reasonability of the Test Set for Simultaneous Machine Translation ICASSP 2023
Simultaneous machine translation (SimulMT) models start translation before
the end of the source sentence, making the translation monotonically aligned
with the source sentence. However, the general full-sentence translation test
set is acquired by offline translation of the entire source sentence, which is
not designed for SimulMT evaluation, making us rethink whether this will
underestimate the performance of SimulMT models. In this paper, we manually
annotate a monotonic test set based on the MuST-C English-Chinese test set,
denoted as SiMuST-C. Our human evaluation confirms the acceptability of our
annotated test set. Evaluations on three different SimulMT models verify that
the underestimation problem can be alleviated on our test set. Further
experiments show that finetuning on an automatically extracted monotonic
training set improves SimulMT models by up to 3 BLEU points.
comment: Accepted by 48th IEEE International Conference on Acoustics, Speech,
and Signal Processing (ICASSP 2023)
☆ Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing
Sheng Zhang, Yanbo Xu, Naoto Usuyama, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Matthew P. Lungren, Tristan Naumann, Hoifung Poon
Contrastive pretraining on parallel image-text data has attained great
success in vision-language processing (VLP), as exemplified by CLIP and related
methods. However, prior explorations tend to focus on general domains in the
web. Biomedical images and text are rather different, but publicly available
datasets are small and skew toward chest X-ray, thus severely limiting
progress. In this paper, we conducted by far the largest study on biomedical
VLP, using 15 million figure-caption pairs extracted from biomedical research
articles in PubMed Central. Our dataset (PMC-15M) is two orders of magnitude
larger than existing biomedical image-text datasets such as MIMIC-CXR, and
spans a diverse range of biomedical images. The standard CLIP method is
suboptimal for the biomedical domain. We propose BiomedCLIP with
domain-specific adaptations tailored to biomedical VLP. We conducted extensive
experiments and ablation studies on standard biomedical imaging tasks from
retrieval to classification to visual question-answering (VQA). BiomedCLIP
established new state of the art in a wide range of standard datasets,
substantially outperformed prior VLP approaches. Surprisingly, BiomedCLIP even
outperformed radiology-specific state-of-the-art models such as BioViL on
radiology-specific tasks such as RSNA pneumonia detection, thus highlighting
the utility in large-scale pretraining across all biomedical image types. We
will release our models at https://aka.ms/biomedclip to facilitate future
research in biomedical VLP.
comment: The models will be released soon at https://aka.ms/biomedclip
☆ Interactive Text Generation
Felix Faltings, Michel Galley, Baolin Peng, Kianté Brantley, Weixin Cai, Yizhe Zhang, Jianfeng Gao, Bill Dolan
Users interact with text, image, code, or other editors on a daily basis.
However, machine learning models are rarely trained in the settings that
reflect the interactivity between users and their editor. This is
understandable as training AI models with real users is not only slow and
costly, but what these models learn may be specific to user interface design
choices. Unfortunately, this means most of the research on text, code, and
image generation has focused on non-interactive settings, whereby the model is
expected to get everything right without accounting for any input from a user
who may be willing to help.
We introduce a new Interactive Text Generation task that allows training
generation models interactively without the costs of involving real users, by
using user simulators that provide edits that guide the model towards a given
target text. We train our interactive models using Imitation Learning, and our
experiments against competitive non-interactive generation models show that
models trained interactively are superior to their non-interactive
counterparts, even when all models are given the same budget of user inputs or
edits.
☆ Stochastic Clustered Federated Learning
Federated learning is a distributed learning framework that takes full
advantage of private data samples kept on edge devices. In real-world federated
learning systems, these data samples are often decentralized and
Non-Independently Identically Distributed (Non-IID), causing divergence and
performance degradation in the federated learning process. As a new solution,
clustered federated learning groups federated clients with similar data
distributions to impair the Non-IID effects and train a better model for every
cluster. This paper proposes StoCFL, a novel clustered federated learning
approach for generic Non-IID issues. In detail, StoCFL implements a flexible
CFL framework that supports an arbitrary proportion of client participation and
newly joined clients for a varying FL system, while maintaining a great
improvement in model performance. The intensive experiments are conducted by
using four basic Non-IID settings and a real-world dataset. The results show
that StoCFL could obtain promising cluster results even when the number of
clusters is unknown. Based on the client clustering results, models trained
with StoCFL outperform baseline approaches in a variety of contexts.
♻ ☆ Let's have a chat! A Conversation with ChatGPT: Technology, Applications, and Limitations
The emergence of an AI-powered chatbot that can generate human-like sentences
and write coherent essays has caught the world's attention. This paper
discusses the historical overview of chatbots and the technology behind Chat
Generative Pre-trained Transformer, better known as ChatGPT. Moreover,
potential applications of ChatGPT in various domains, including healthcare,
education, and research, are highlighted. Despite promising results, there are
several privacy and ethical concerns surrounding ChatGPT. In addition, we
highlight some of the important limitations of the current version of ChatGPT.
We also ask ChatGPT to provide its point of view and present its responses to
several questions we attempt to answer.
♻ ☆ Small-Text: Active Learning for Text Classification in Python EACL 2023
We introduce small-text, an easy-to-use active learning library, which offers
pool-based active learning for single- and multi-label text classification in
Python. It features numerous pre-implemented state-of-the-art query strategies,
including some that leverage the GPU. Standardized interfaces allow the
combination of a variety of classifiers, query strategies, and stopping
criteria, facilitating a quick mix and match, and enabling a rapid and
convenient development of both active learning experiments and applications.
With the objective of making various classifiers and query strategies
accessible for active learning, small-text integrates several well-known
machine learning libraries, namely scikit-learn, PyTorch, and Hugging Face
transformers. The latter integrations are optionally installable extensions, so
GPUs can be used but are not required. Using this new library, we investigate
the performance of the recently published SetFit training paradigm, which we
compare to vanilla transformer fine-tuning, finding that it matches the latter
in classification accuracy while outperforming it in area under the curve. The
library is available under the MIT License at
https://github.com/webis-de/small-text, in version 1.3.0 at the time of
writing.
comment: EACL 2023 System Demonstrations (camera-ready)
♻ ☆ ferret: a Framework for Benchmarking Explainers on Transformers EACL 2023
As Transformers are increasingly relied upon to solve complex NLP problems,
there is an increased need for their decisions to be humanly interpretable.
While several explainable AI (XAI) techniques for interpreting the outputs of
transformer-based models have been proposed, there is still a lack of easy
access to using and comparing them. We introduce ferret, a Python library to
simplify the use and comparisons of XAI methods on transformer-based
classifiers. With ferret, users can visualize and compare transformers-based
models output explanations using state-of-the-art XAI methods on any free-text
or existing XAI corpora. Moreover, users can also evaluate ad-hoc XAI metrics
to select the most faithful and plausible explanations. To align with the
recently consolidated process of sharing and using transformers-based models
from Hugging Face, ferret interfaces directly with its Python library. In this
paper, we showcase ferret to benchmark XAI methods used on transformers for
sentiment analysis and hate speech detection. We show how specific methods
provide consistently better explanations and are preferable in the context of
transformer models.
comment: 11 pages, 3 figures. Accepted to EACL 2023 (System Demonstration).
More details at https://github.com/g8a9/ferret
♻ ☆ Can ChatGPT Understand Too? A Comparative Study on ChatGPT and Fine-tuned BERT
Recently, ChatGPT has attracted great attention, as it can generate fluent
and high-quality responses to human inquiries. Several prior studies have shown
that ChatGPT attains remarkable generation ability compared with existing
models. However, the quantitative analysis of ChatGPT's understanding ability
has been given little attention. In this report, we explore the understanding
ability of ChatGPT by evaluating it on the most popular GLUE benchmark, and
comparing it with 4 representative fine-tuned BERT-style models. We find that:
1) ChatGPT falls short in handling paraphrase and similarity tasks; 2) ChatGPT
outperforms all BERT models on inference tasks by a large margin; 3) ChatGPT
achieves comparable performance compared with BERT on sentiment analysis and
question-answering tasks. Additionally, by combining some advanced prompting
strategies, we show that the understanding ability of ChatGPT can be further
improved.
comment: Work in progress. Added results of advanced prompting strategies,
e.g., CoT. (19 pages)
♻ ☆ Like a Good Nearest Neighbor: Practical Content Moderation with Sentence Transformers
Modern text classification systems have impressive capabilities but are
infeasible to deploy and use reliably due to their dependence on prompting and
billion-parameter language models. SetFit (Tunstall et al., 2022) is a recent,
practical approach that fine-tunes a Sentence Transformer under a contrastive
learning paradigm and achieves similar results to more unwieldy systems. Text
classification is important for addressing the problem of domain drift in
detecting harmful content, which plagues all social media platforms. Here, we
propose Like a Good Nearest Neighbor (LaGoNN), an inexpensive modification to
SetFit that requires no additional parameters or hyperparameters but modifies
input with information about its nearest neighbor, for example, the label and
text, in the training data, making novel data appear similar to an instance on
which the model was optimized. LaGoNN is effective at the task of detecting
harmful content and generally improves performance compared to SetFit. To
demonstrate the value of our system, we conduct a thorough study of text
classification systems in the context of content moderation under four label
distributions.
comment: 8 pages, 4 figures, 13 supplemental pages, 15 supplemental figures
♻ ☆ YATO: Yet Another deep learning based Text analysis Open toolkit
We introduce YATO, an open-source toolkit for text analysis with deep
learning. It focuses on fundamental sequence labeling and sequence
classification tasks on text. Designed in a hierarchical structure, YATO
supports free combinations of three types of features including 1) traditional
neural networks (CNN, RNN, etc.); 2) pre-trained language models (BERT,
RoBERTa, ELECTRA, etc.); and 3) user-customed neural features via a simple
configurable file. Benefiting from the advantages of flexibility and ease of
use, YATO can facilitate reproducing and refinement of state-of-the-art NLP
models, and promote the cross-disciplinary applications of NLP techniques.
Source code, examples, and documentation are publicly available at
https://github.com/jiesutd/YATO. A demo video is also available at
https://youtu.be/tSjjf5BzfQg.
♻ ☆ The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training CVPR 2023
Visual dialog (VisDial) is a task of answering a sequence of questions
grounded in an image, using the dialog history as context. Prior work has
trained the dialog agents solely on VisDial data via supervised learning or
leveraged pre-training on related vision-and-language datasets. This paper
presents a semi-supervised learning approach for visually-grounded dialog,
called Generative Self-Training (GST), to leverage unlabeled images on the Web.
Specifically, GST first retrieves in-domain images through out-of-distribution
detection and generates synthetic dialogs regarding the images via multimodal
conditional text generation. GST then trains a dialog agent on the synthetic
and the original VisDial data. As a result, GST scales the amount of training
data up to an order of magnitude that of VisDial (1.2M to 12.9M QA data). For
robust training of the synthetic dialogs, we also propose perplexity-based data
selection and multimodal consistency regularization. Evaluation on VisDial v1.0
and v0.9 datasets shows that GST achieves new state-of-the-art results on both
datasets. We further observe the robustness of GST against both visual and
textual adversarial attacks. Finally, GST yields strong performance gains in
the low-data regime. Code is available at
https://github.com/gicheonkang/gst-visdial.
comment: CVPR 2023
♻ ☆ Internal Language Model Estimation based Adaptive Language Model Fusion for Domain Adaptation ICASSP 2023
ASR model deployment environment is ever-changing, and the incoming speech
can be switched across different domains during a session. This brings a
challenge for effective domain adaptation when only target domain text data is
available, and our objective is to obtain obviously improved performance on the
target domain while the performance on the general domain is less undermined.
In this paper, we propose an adaptive LM fusion approach called internal
language model estimation based adaptive domain adaptation (ILME-ADA). To
realize such an ILME-ADA, an interpolated log-likelihood score is calculated
based on the maximum of the scores from the internal LM and the external LM
(ELM) respectively. We demonstrate the efficacy of the proposed ILME-ADA method
with both RNN-T and LAS modeling frameworks employing neural network and n-gram
LMs as ELMs respectively on two domain specific (target) test sets. The
proposed method can achieve significantly better performance on the target test
sets while it gets minimal performance degradation on the general test set,
compared with both shallow and ILME-based LM fusion methods.
comment: Accepted by ICASSP 2023
♻ ☆ TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation ICLR 2023
Direct speech-to-speech translation (S2ST) with discrete units leverages
recent progress in speech representation learning. Specifically, a sequence of
discrete representations derived in a self-supervised manner are predicted from
the model and passed to a vocoder for speech reconstruction, while still facing
the following challenges: 1) Acoustic multimodality: the discrete units derived
from speech with same content could be indeterministic due to the acoustic
property (e.g., rhythm, pitch, and energy), which causes deterioration of
translation accuracy; 2) high latency: current S2ST systems utilize
autoregressive models which predict each unit conditioned on the sequence
previously generated, failing to take full advantage of parallelism. In this
work, we propose TranSpeech, a speech-to-speech translation model with
bilateral perturbation. To alleviate the acoustic multimodal problem, we
propose bilateral perturbation (BiP), which consists of the style normalization
and information enhancement stages, to learn only the linguistic information
from speech samples and generate more deterministic representations. With
reduced multimodality, we step forward and become the first to establish a
non-autoregressive S2ST technique, which repeatedly masks and predicts unit
choices and produces high-accuracy results in just a few cycles. Experimental
results on three language pairs demonstrate that BiP yields an improvement of
2.9 BLEU on average compared with a baseline textless S2ST model. Moreover, our
parallel decoding shows a significant reduction of inference latency, enabling
speedup up to 21.4x than autoregressive technique. Audio samples are available
at \url{https://TranSpeech.github.io/}
comment: Accpeted to ICLR 2023
♻ ☆ Factuality Enhanced Language Models for Open-Ended Text Generation NeurIPS 2022
Pretrained language models (LMs) are susceptible to generate text with
nonfactual information. In this work, we measure and improve the factual
accuracy of large-scale LMs for open-ended text generation. We design the
FactualityPrompts test set and metrics to measure the factuality of LM
generations. Based on that, we study the factual accuracy of LMs with parameter
sizes ranging from 126M to 530B. Interestingly, we find that larger LMs are
more factual than smaller ones, although a previous study suggests that larger
LMs can be less truthful in terms of misconceptions. In addition, popular
sampling algorithms (e.g., top-p) in open-ended text generation can harm the
factuality due to the ''uniform randomness'' introduced at every sampling step.
We propose the factual-nucleus sampling algorithm that dynamically adapts the
randomness to improve the factuality of generation while maintaining quality.
Furthermore, we analyze the inefficiencies of the standard training method in
learning correct associations between entities from factual text corpus (e.g.,
Wikipedia). We propose a factuality-enhanced training method that uses
TopicPrefix for better awareness of facts and sentence completion as the
training objective, which can vastly reduce the factual errors. We release our
code and FactualityPrompts benchmark at:
https://github.com/nayeon7lee/FactualityPrompt.
comment: NeurIPS 2022
♻ ☆ Viterbi Decoding of Directed Acyclic Transformer for Non-Autoregressive Machine Translation EMNLP 2022
Non-autoregressive models achieve significant decoding speedup in neural
machine translation but lack the ability to capture sequential dependency.
Directed Acyclic Transformer (DA-Transformer) was recently proposed to model
sequential dependency with a directed acyclic graph. Consequently, it has to
apply a sequential decision process at inference time, which harms the global
translation accuracy. In this paper, we present a Viterbi decoding framework
for DA-Transformer, which guarantees to find the joint optimal solution for the
translation and decoding path under any length constraint. Experimental results
demonstrate that our approach consistently improves the performance of
DA-Transformer while maintaining a similar decoding speedup.
comment: Findings of EMNLP 2022
♻ ☆ On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective
Jindong Wang, Xixu Hu, Wenxin Hou, Hao Chen, Runkai Zheng, Yidong Wang, Linyi Yang, Haojun Huang, Wei Ye, Xiubo Geng, Binxin Jiao, Yue Zhang, Xing Xie
ChatGPT is a recent chatbot service released by OpenAI and is receiving
increasing attention over the past few months. While evaluations of various
aspects of ChatGPT have been done, its robustness, i.e., the performance to
unexpected inputs, is still unclear to the public. Robustness is of particular
concern in responsible AI, especially for safety-critical applications. In this
paper, we conduct a thorough evaluation of the robustness of ChatGPT from the
adversarial and out-of-distribution (OOD) perspective. To do so, we employ the
AdvGLUE and ANLI benchmarks to assess adversarial robustness and the Flipkart
review and DDXPlus medical diagnosis datasets for OOD evaluation. We select
several popular foundation models as baselines. Results show that ChatGPT shows
consistent advantages on most adversarial and OOD classification and
translation tasks. However, the absolute performance is far from perfection,
which suggests that adversarial and OOD robustness remains a significant threat
to foundation models. Moreover, ChatGPT shows astounding performance in
understanding dialogue-related texts and we find that it tends to provide
informal suggestions for medical tasks instead of definitive answers. Finally,
we present in-depth discussions of possible research directions.
comment: Technical report; code is at:
https://github.com/microsoft/robustlearn
♻ ☆ Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning ICLR 2023
Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, Ashwin Kalyan
Mathematical reasoning, a core ability of human intelligence, presents unique
challenges for machines in abstract thinking and logical reasoning. Recent
large pre-trained language models such as GPT-3 have achieved remarkable
progress on mathematical reasoning tasks written in text form, such as math
word problems (MWP). However, it is unknown if the models can handle more
complex problems that involve math reasoning over heterogeneous information,
such as tabular data. To fill the gap, we present Tabular Math Word Problems
(TabMWP), a new dataset containing 38,431 open-domain grade-level problems that
require mathematical reasoning on both textual and tabular data. Each question
in TabMWP is aligned with a tabular context, which is presented as an image,
semi-structured text, and a structured table. There are two types of questions:
free-text and multi-choice, and each problem is annotated with gold solutions
to reveal the multi-step reasoning process. We evaluate different pre-trained
models on TabMWP, including the GPT-3 model in a few-shot setting. As earlier
studies suggest, since few-shot GPT-3 relies on the selection of in-context
examples, its performance is unstable and can degrade to near chance. The
unstable issue is more severe when handling complex problems like TabMWP. To
mitigate this, we further propose a novel approach, PromptPG, which utilizes
policy gradient to learn to select in-context examples from a small amount of
training data and then constructs the corresponding prompt for the test
example. Experimental results show that our method outperforms the best
baseline by 5.31% on the accuracy metric and reduces the prediction variance
significantly compared to random selection, which verifies its effectiveness in
selecting in-context examples.
comment: ICLR 2023. 26 pages and 18 figures. The data and code are available
at https://promptpg.github.io
♻ ☆ TextWorldExpress: Simulating Text Games at One Million Steps Per Second EACL 2023
Text-based games offer a challenging test bed to evaluate virtual agents at
language understanding, multi-step problem-solving, and common-sense reasoning.
However, speed is a major limitation of current text-based games, capping at
300 steps per second, mainly due to the use of legacy tooling. In this work we
present TextWorldExpress, a high-performance simulator that includes
implementations of three common text game benchmarks that increases simulation
throughput by approximately three orders of magnitude, reaching over one
million steps per second on common desktop hardware. This significantly reduces
experiment runtime, enabling billion-step-scale experiments in about one day.
comment: Accepted to EACL 2023
♻ ☆ Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought ICLR 2023
Large language models (LLMs) have shown remarkable reasoning capabilities
given chain-of-thought prompts (examples with intermediate reasoning steps).
Existing benchmarks measure reasoning ability indirectly, by evaluating
accuracy on downstream tasks such as mathematical reasoning. However, it is
unclear how these models obtain the answers and whether they rely on simple
heuristics rather than the generated chain-of-thought. To enable systematic
exploration of the reasoning ability of LLMs, we present a new synthetic
question-answering dataset called PrOntoQA, where each example is generated
from a synthetic world model represented in first-order logic. This allows us
to parse the generated chain-of-thought into symbolic proofs for formal
analysis. Our analysis on InstructGPT and GPT-3 shows that LLMs are quite
capable of making correct individual deduction steps, and so are generally
capable of reasoning, even in fictional contexts. However, they have difficulty
with proof planning: When multiple valid deduction steps are available, they
are not able to systematically explore the different options.
comment: Published as a conference paper at ICLR 2023
♻ ☆ Learning to Locate Visual Answer in Video Corpus Using Question ICASSP 2023
We introduce a new task, named video corpus visual answer localization
(VCVAL), which aims to locate the visual answer in a large collection of
untrimmed instructional videos using a natural language question. This task
requires a range of skills - the interaction between vision and language, video
retrieval, passage comprehension, and visual answer localization. In this
paper, we propose a cross-modal contrastive global-span (CCGS) method for the
VCVAL, jointly training the video corpus retrieval and visual answer
localization subtasks with the global-span matrix. We have reconstructed a
dataset named MedVidCQA, on which the VCVAL task is benchmarked. Experimental
results show that the proposed method outperforms other competitive methods
both in the video corpus retrieval and visual answer localization subtasks.
Most importantly, we perform detailed analyses on extensive experiments, paving
a new path for understanding the instructional videos, which ushers in further
research.
comment: Accepted by ICASSP 2023
♻ ☆ A Zipf's Law-Driven Method for Extracting Entities from Documents
Entity extraction is critical to the intelligent development of various
domains and the construction of knowledge agents. Yet, there is category
imbalance problem in documents in some specific domains that some categories of
entities are common, while some are rare and scattered. This paper proposes to
use Zipf's law to tackle this problem and to promote the performance of entity
extraction from documents. Using two forms of Zipf's law, words in the
documents are classified into common and rare ones, and then sentences are
classified into common and rare ones, and are further processed by text
generation models respectively. Rare entities in the generated sentences are
labeled with human-designed rules, and serve as a supplement to the raw dataset
so as to alleviate the category imbalance problem. A case of extracting
entities from technical documents on industrial safety is given and the
experiments results on two datasets show the effectiveness of the proposed
method.
comment: Journal of Informetrics